Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interface for eigensolver on QUDA #561

Merged
merged 32 commits into from
Oct 4, 2023
Merged

Conversation

aniketsen
Copy link
Contributor

@aniketsen aniketsen commented May 9, 2023

eigsolveQuda acts as interface for calling the eigensolver on QUDA. In phmc.c, this interface is called based on input parameter UseExternalEigSolver which is stored in monomial parameter external_eigsolver. The interface also initializes QUDA and initializes the two flavor solver.

@kostrzewa
Copy link
Member

kostrzewa commented May 24, 2023

I tested this on a 32c64 lattice and it seems to work correctly as far as I can tell. However, I think the time to solution for the lowest eigenvalue can be reduced significantly. On the machine in question (single quad-A100 node), the solver takes around 200 seconds for the lowest eigenvalue and around 10 seconds for the largest one (precision was too low, it's actually over 450 seconds in total) . The CPU solver, on just 2x16 AMD EPYC2 cores, takes "only" 1000 seconds by comparison. I thus suspect that one can speed up the calculation of the lowest eigenvalue by another factor of 10 or so at least. If you have the time and the inclination, @aniketsen, it would be great if you could explore the polynomial acceleration a little. The documentation of the eigensolver on the QUDA Wiki is not bad as far as I can tell.

My guess is that appropriate settings for a_min and a_max (depending on whether one is looking for the smallest or the largest eigenvalue) will speed this up massively. I'm not sure about the order of the polynomial that one should use (poly_deg).

@kostrzewa
Copy link
Member

No poly accel:

# TM_QUDA: mu = 0.138390000000, epsilon = 0.146560000000 kappa = 0.140005000000, csw = 1.711200000000
# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 412 restart steps and 19824 OP*x operations.
# QUDA: Eval[0000] = (+1.1364231284241534e-04,-5.9654390341346649e-22) ||+1.1364231284241534e-04|| Residual = +1.3907589993294887e-14
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 4.568580e+02 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
# TM_QUDA: Using single prec. as sloppy!
# TM_QUDA: Using double-half refinement in mshift-solver!
# TM_QUDA: Called _loadGaugeQuda for gauge_id: 0.000000
# TM_QUDA: Theta boundary conditions will be applied to gauge field
# TM_QUDA: Using mixed precision CG!
# TM_QUDA: Using EO preconditioning!
# TM_QUDA: Time for loadCloverQuda 1.272554e-02 s level: 3 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda/loadCloverQuda
# TM_QUDA: mu = 0.138390000000, epsilon = 0.146560000000 kappa = 0.140005000000, csw = 1.711200000000
# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 4 restart steps and 240 OP*x operations.
# QUDA: Eval[0000] = (+3.0261608378473412e+00,-6.2242957455787945e-17) ||+3.0261608378473412e+00|| Residual = +1.0169756438652394e-14
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 5.180968e+00 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda

smallest: 457 seconds
largest: 5 seconds

TOTAL: 462 seconds (compared to over 1000 seconds on the CPU for this particular ensemble and machine)

untuned poly accel:

(note that poly acceleration exchanges SR <-> LR)

# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: spectrum LR
# QUDA: tol 1.0000e-14
# QUDA: n_conv 1
# QUDA: n_ev 1
# QUDA: n_kr 96
# QUDA: polyDeg 128
# QUDA: a-min 0.001000
# QUDA: a-max 4.000000
# QUDA: Resizing kSpace to 144 vectors
# QUDA: 0000 converged eigenvalues at restart iter 0001
# QUDA: 0000 converged eigenvalues at restart iter 0002
# QUDA: Resizing kSpace to 145 vectors
# QUDA: 0001 converged eigenvalues at restart iter 0003
# QUDA: TRLM computed the requested 1 vectors in 3 restart steps and 192 OP*x operations.
# QUDA: RitzValue[0000]: (+7.9066095102425749e-01, +0.0000000000000000e+00) residual 4.0309354968575832e-17
# QUDA: Eval[0000] = (+1.1364231284240960e-04,+1.4837671236894835e-21) ||+1.1364231284240960e-04|| Residual = +1.7886132427934613e-15
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 7.718927e+01 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
[...]
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: spectrum SR
# QUDA: tol 1.0000e-14
# QUDA: n_conv 1
# QUDA: n_ev 1
# QUDA: n_kr 96
# QUDA: polyDeg 128
# QUDA: a-min 0.001000
# QUDA: a-max 4.000000
# QUDA: Resizing kSpace to 144 vectors
# QUDA: 0000 converged eigenvalues at restart iter 0001
# QUDA: 0000 converged eigenvalues at restart iter 0002
# QUDA: Resizing kSpace to 145 vectors
# QUDA: 0001 converged eigenvalues at restart iter 0003
# QUDA: TRLM computed the requested 1 vectors in 3 restart steps and 192 OP*x operations.
# QUDA: RitzValue[0000]: (+7.9066095102425749e-01, +0.0000000000000000e+00) residual 4.0309354968575832e-17
# QUDA: Eval[0000] = (+1.1364231284240960e-04,+1.4837671236894835e-21) ||+1.1364231284240960e-04|| Residual = +1.7886132427934613e-15
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 7.709653e+01 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda

It seems that with the chosen settings

  eig_param.use_poly_acc = QUDA_BOOLEAN_TRUE;
  eig_param.poly_deg = 128; 
  eig_param.a_min = 1e-3;
  eig_param.a_max = 4;                                                                                                                     

that the solve for the largest eigenvalue does not work. Likely one needs to disable poly accel for that. (since the solve time is super short, that's perfectly fine)

00000000 2.41792e-05 2.41792e-05 8.00000e-06 1.00000e+00

The timing reflects this fact:

smallest: 77 seconds
"largest": 77 seconds

TOTAL: 154 seconds

Proposal for default

double eigsolveQuda(..){
[...]
  eig_param.use_poly_acc = maxmin == 1 ? QUDA_BOOLEAN_FALSE : QUDA_BOOLEAN_TRUE;                                                           
  eig_param.poly_deg = 128; 
  eig_param.a_min = 1e-3;
  eig_param.a_max = 4; 
[...]

and we might need some new parameters to make the degree, a_min and a_max user-configurable.

# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 3 restart steps and 192 OP*x operations.
# QUDA: RitzValue[0000]: (+7.9066095102425749e-01, +0.0000000000000000e+00) residual 4.0309354968575832e-17
# QUDA: Eval[0000] = (+1.1364231284240960e-04,+1.4837671236894835e-21) ||+1.1364231284240960e-04|| Residual = +1.7886132427934613e-15
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 7.718751e+01 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda
[...]
# QUDA: Orthonormalising initial guess
# QUDA: ********************************
# QUDA: **** START QUDA EIGENSOLVER ****
# QUDA: ********************************
# QUDA: TRLM computed the requested 1 vectors in 4 restart steps and 240 OP*x operations.
# QUDA: Eval[0000] = (+3.0261608378473412e+00,-6.2242957455787945e-17) ||+3.0261608378473412e+00|| Residual = +1.0169756438652394e-14
# QUDA: ********************************
# QUDA: ***** END QUDA EIGENSOLVER *****
# QUDA: ********************************
# TM_QUDA: Time for eigsolveQuda 5.185351e+00 s level: 2 proc_id: 0 /HMC/ndcloverrat1:ndrat_heatbath/eigsolveQuda

smallest: 77 seconds
largest: 5 seconds

TOTAL: 82 seconds

I'm pretty sure that this can be improved further by lowering the polynomial degree and optimising a_min, as described here: https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers#using-chebyshev-in-your-computations

@kostrzewa
Copy link
Member

kostrzewa commented May 24, 2023

If you have some time, maybe you can take a look at / merge #564 and try to make the parameters user-configurable and/or test around with some production ensembles at various lattice spacings if the hard-coded defaults are okay for now. We can certainly live with 80ish seconds for the eigensolver in the HMC.

aniketsen and others added 2 commits June 7, 2023 10:26
first stab at using polynomial acceleration for the ND eigensolver
…urb global inv_param when changing precision, adjust indentation in one place

eig_param.invert_param = &inv_param;

// need our own QudaInvertParam for passing the operator properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aniketsen note that I've added this just now because the change of precision is performed in the global inv_param, which might have unforeseen consequences in certain situations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way, the global inv_param is never touched via the pointer

@kostrzewa
Copy link
Member

@aniketsen slowly getting back into finalizing this and other things

As far as I can tell there is currently no way to actually build test_eigsolveQuda (I don't see any changes in the build system). In test_eigensolveQuda.c I'm also a bit confused by the set of monomials listed as ComputeEVFreq can only be set for a subset of these (from read_input.l):

2582 <NDPOLYMONOMIAL,CLPOLYMONOMIAL,NDRATMONOMIAL,NDRATCORMONOMIAL,NDCLRATMONOMIAL,NDCLRATCORMONOMIAL,RATMONOMIAL,RATCORMONOMIAL,CLRATMONOMIAL,CLRATCORMONOMIAL>
[...]
2597   {SPC}*ComputeEVFreq{EQL}{DIGIT}+ {                                                                                                  
2598     sscanf(yytext, " %[a-zA-Z] = %d", name, &a); 
2599     mnl->rec_ev = a;
2600     if(myverbose!=0) printf("  Frequency for computing EV's set to %d in line %d monomial %d\n", mnl->rec_ev, line_of_file,                current_monomial);
2601   }

If you want we can just leave test_eigsolveQuda out of this PR because it doesn't really offer the kind of stand-alone eigensolver interface that I was hoping for, rather, it focuses just on the min/max evals and I'm not sure whether it actually works correctly. When you cycle through the monomials:

361   for(j = 0; j < no_monomials; j++) {
362     if( (monomial_list[j].type == NDPOLY) || (monomial_list[j].type == NDDETRATIO)
363     || (monomial_list[j].type == NDCLOVER) || (monomial_list[j].type == NDRAT)
364     || (monomial_list[j].type == NDCLOVERRAT) || (monomial_list[j].type == NDRATCOR)
365     || (monomial_list[j].type == NDCLOVERRATCOR) || (monomial_list[j].type == NDCLOVERDETRATIO) ) {
366       if( (monomial_list[j].rec_ev != 0) ) {
367         monomial * mnl = &monomial_list[j];

you don't actually set any of the global parameters based on the monomial and instead immediately call eigsolveQuda. This means that things like g_kappa and g_mu will simply be unset, right?

If you look at something like ndrat_monomial.c, this is what happens before the eigensolver driver is called:

205 void ndrat_heatbath(const int id, hamiltonian_field_t * const hf) {
206   monomial * mnl = &monomial_list[id];
207   tm_stopwatch_push(&g_timers, __func__, mnl->name);
208   nd_set_global_parameter(mnl);
209   mnl->iter1 = 0;
210   if(mnl->type == NDCLOVERRAT) {
211     init_sw_fields();
212     sw_term((const su3**)hf->gaugefield, mnl->kappa, mnl->c_sw);
213     sw_invert_nd(mnl->mubar*mnl->mubar - mnl->epsbar*mnl->epsbar);
214     copy_32_sw_fields();
215   }
216   // we measure before the trajectory! 
217   if((mnl->rec_ev != 0) && (hf->traj_counter%mnl->rec_ev == 0)) {                                                                      
218     if(mnl->type != NDCLOVERRAT) phmc_compute_ev(hf->traj_counter-1, id, &Qtm_pm_ndbipsi);
219     else phmc_compute_ev(hf->traj_counter-1, id, &Qsw_pm_ndbipsi);
220   }

where nd_set_global_parameter is:

 52 void nd_set_global_parameter(monomial * const mnl) {
 53 
 54   g_mubar = mnl->mubar;
 55   g_epsbar = mnl->epsbar;
 56   g_kappa = mnl->kappa;
 57   g_c_sw = mnl->c_sw;
 58   boundary(g_kappa);
 59   phmc_cheb_evmin = mnl->EVMin;
 60   phmc_invmaxev = mnl->EVMaxInv;
 61   phmc_cheb_evmax = mnl->EVMax;
 62   phmc_Cpol = 1.;
 63   // used for preconditioning in cloverdetrat
 64   g_mu3 = 0.;
 65 
 66   return;
 67 }

@aniketsen
Copy link
Contributor Author

aniketsen commented Sep 19, 2023

As far as I can tell there is currently no way to actually build test_eigsolveQuda (I don't see any changes in the build system). In test_eigensolveQuda.c I'm also a bit confused by the set of monomials listed as ComputeEVFreq can only be set for a subset of these (from read_input.l):

yes, I did not push the changes in the build system. I had added all the monomials, for a more general approach, but yes, ComputeEVFreq is not defined for all of them. But in these cases the variable rec_ev is set to zero by default, and the eigensolver is simply not called.

If you want we can just leave test_eigsolveQuda out of this PR because it doesn't really offer the kind of stand-alone eigensolver interface that I was hoping for, rather, it focuses just on the min/max evals and I'm not sure whether it actually works correctly. When you cycle through the monomials:

yes, this is a mistake. For some reason I thought init_monomials also initialized the global parameters. But of course
one needs to call nd_set_global_parameter if one wants to cycle through all the different monomials.

You can just leave this script out. I agree that this does not really perform a very useful function at the moment, other than just calling the eigensolver for different monomials. But you can also easily get the same job done by setting ComputeEVFreq to 1 in the standard hmc.c script.

@kostrzewa kostrzewa merged commit 49b0dea into quda_work Oct 4, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants